{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### BeautifulSoup\n",
    "\n",
    "简单来说，Beautiful Soup 是 python 的一个库，最主要的功能是从网页抓取数据。官方解释如下：\n",
    "\n",
    ">Beautiful Soup 提供一些简单的、python 式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。 Beautiful Soup 自动将输入文档转换为 Unicode 编码，输出文档转换为 utf-8 编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup 就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。 Beautiful Soup 已成为和 lxml、html6lib 一样出色的 python 解释器，为用户灵活地提供不同的解析策略或强劲的速度。\n",
    "\n",
    "Beautiful Soup 3 目前已经停止开发，推荐在现在的项目中使用 Beautiful Soup 4，不过它已经被移植到 BS4 了，也就是说导入时我们需要 import bs4 。所以这里我们用的版本是 Beautiful Soup 4.3.2 (简称 BS4)，\n",
    "\n",
    ">可以利用 pip 来安装.\n",
    "\n",
    "``pip install beautifulsoup4``\n",
    "\n",
    "Beautiful Soup 支持 Python 标准库中的 HTML 解析器，还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python 默认的解析器，而lxml 解析器更加强大，速度更快，推荐安装。\n",
    "\n",
    "\n",
    "lxml HTML 解析器 [教程一](https://lxml.de/)， [教程二](https://www.jianshu.com/p/8f6917e4e6dd)\n",
    "\n",
    "``\n",
    "BeautifulSoup(markup, 'lxml')\n",
    "``\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "####  官方文档\n",
    "\n",
    "在这里先分享官方文档链接，不过内容是有些多，但是是最权威的，而且在实时更新。建议大家收藏\n",
    "\n",
    "https://beautifulsoup.readthedocs.io/zh_CN/latest/\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 第一步：创建BeautifulSoup对象 \n",
    "\n",
    "导入我们需要的库 bs4 \n",
    "\n",
    "然后创建一个文档，从新浪财经的网站上面截取的一部分\n",
    "\n",
    "\\assets\\html.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(open('html.html','r',encoding='utf-8'), 'lxml')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n",
      "<html xmlns=\"http://www.w3.org/1999/xhtml\"><body><p>\n",
      "\n",
      "​\t\n",
      "\t\t\t</p><title>个股点评_证券_新浪财经</title>\n",
      "<meta content=\"个股点评_证券_新浪财经\" name=\"Keywords\"/>\n",
      "<div class=\"hs01\"> </div>\n",
      "<ul class=\"list_009\">\n",
      "<li><a href=\"https://finance.sina.com.cn/stock/zqgd/2020-11-22/doc-iiznezxs3063378.shtml\" target=\"_blank\">*ST欧浦或面临退市:因公司控股股东佛山市中基投资宣告破产</a><span>(11月22日 07:17)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-20/doc-iiznctke2383291.shtml\" target=\"_blank\">拉尼娜来袭 一文看清相关行业投资机会（附股）</a><span>(11月20日 11:25)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-20/doc-iiznezxs2782833.shtml\" target=\"_blank\">暴雪肆虐冷空气“发威”：煤炭供应趋紧 这些厂商躺赢？</a><span>(11月20日 08:55)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/e/2020-11-19/doc-iiznezxs2717397.shtml\" target=\"_blank\">2020年11月20日涨停板早知道：七大利好有望发酵</a><span>(11月19日 20:05)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-19/doc-iiznctke2237870.shtml\" target=\"_blank\">三大运营商或于年底宣布5G消息商用 产业链标的有望受益（附股）</a><span>(11月19日 14:22)</span></li>\n",
      "</ul> <ul class=\"list_009\">\n",
      "<li><a href=\"https://finance.sina.com.cn/roll/2020-11-19/doc-iiznezxs2652793.shtml\" target=\"_blank\">军工股午后崛起：航空产业链业绩提升 订单量增速有望扩大</a><span>(11月19日 13:33)</span></li> <li><a href=\"https://finance.sina.com.cn/roll/2020-11-19/doc-iiznctke2194128.shtml\" target=\"_blank\">国常会再提促进家电消费：家电股迎政策红利 两条主线布局</a><span>(11月19日 10:29)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-19/doc-iiznezxs2611976.shtml\" target=\"_blank\">涨价题材火爆：有机硅价格创年内新高 最全概念股名单来了</a><span>(11月19日 09:45)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/e/2020-11-18/doc-iiznezxs2537290.shtml\" target=\"_blank\">2020年11月19日涨停板早知道：七大利好有望发酵</a><span>(11月18日 19:26)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/s/2020-11-18/doc-iiznezxs2427923.shtml\" target=\"_blank\">前三季中国拿下世界造船业半数订单 成全球重要造船中心(股)</a><span>(11月18日 09:28)</span></li>\n",
      "</ul> <ul class=\"list_009\">\n",
      "<li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-18/doc-iiznctke2000118.shtml\" target=\"_blank\">手机摄像头出货量回暖：多摄趋势加速渗透 产业链有望持续受益</a><span>(11月18日 08:58)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/e/2020-11-17/doc-iiznctke1936558.shtml\" target=\"_blank\">2020年11月18日涨停板早知道：七大利好有望发酵</a><span>(11月17日 19:36)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-17/doc-iiznctke1819613.shtml\" target=\"_blank\">能源工业云网正式发布 赋能能源产业链(附股)</a><span>(11月17日 09:18)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-17/doc-iiznezxs2241938.shtml\" target=\"_blank\">10月装车辆同比翻倍：磷酸铁锂强势回归 龙头股价迭创新高(股)</a><span>(11月17日 08:53)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-17/doc-iiznctke1819184.shtml\" target=\"_blank\">全球首款定制网约车来了：滴滴出行携手比亚迪 概念股站上风口</a><span>(11月17日 08:53)</span></li>\n",
      "</ul> <ul class=\"list_009\">\n",
      "<li><a href=\"https://finance.sina.com.cn/stock/e/2020-11-16/doc-iiznctke1746505.shtml\" target=\"_blank\">2020年11月17日涨停板早知道：七大利好有望发酵</a><span>(11月16日 19:07)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-16/doc-iiznctke1667121.shtml\" target=\"_blank\">有色板块多股涨停：电解铝、稀土价格有望持续修复反弹(附股)</a><span>(11月16日 11:33)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-16/doc-iiznezxs2076170.shtml\" target=\"_blank\">医美板块大涨：疫情趋稳需求恢复 三条赛道布局医疗美容(股)</a><span>(11月16日 10:48)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-16/doc-iiznctke1638867.shtml\" target=\"_blank\">疫苗超低温冰柜脱销 冷链板块有望重返高光时刻？(名单)</a><span>(11月16日 08:55)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-16/doc-iiznezxs2058212.shtml\" target=\"_blank\">全球最大自贸协定达成：零关税产品超90% 概念股名单来了</a><span>(11月16日 08:46)</span></li>\n",
      "</ul> <ul class=\"list_009\">\n",
      "<li><a href=\"https://finance.sina.com.cn/stock/e/2020-11-15/doc-iiznctke1577782.shtml\" target=\"_blank\">2020年11月16日涨停板早知道：七大利好有望发酵</a><span>(11月15日 19:36)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-13/doc-iiznezxs1746412.shtml\" target=\"_blank\">政策暖风频吹：多地抓紧布局 关注燃料电池产业链</a><span>(11月13日 21:36)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-13/doc-iiznezxs1731872.shtml\" target=\"_blank\">车联网细分赛道迎重大风口：板块概念股名单来了</a><span>(11月13日 19:37)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-13/doc-iiznezxs1713560.shtml\" target=\"_blank\">旺季开锣：冰雪旅游预订量飙涨300倍 相关概念股全梳理</a><span>(11月13日 17:32)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/e/2020-11-12/doc-iiznezxs1532374.shtml\" target=\"_blank\">2020年11月13日涨停板早知道：七大利好有望发酵</a><span>(11月12日 19:27)</span></li>\n",
      "</ul> <ul class=\"list_009\">\n",
      "<li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-11/doc-iiznctke0924923.shtml\" target=\"_blank\">顺周期概念股全面爆发 还有哪些板块可挖掘？</a><span>(11月11日 21:45)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/e/2020-11-11/doc-iiznctke0909075.shtml\" target=\"_blank\">2020年11月12日涨停板早知道：七大利好有望发酵</a><span>(11月11日 19:35)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-11/doc-iiznezxs1281954.shtml\" target=\"_blank\">江苏省发布区块链产业发展计划 相关应用有望提速(附股)</a><span>(11月11日 15:04)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/e/2020-11-10/doc-iiznezxs1113487.shtml\" target=\"_blank\">2020年11月11日涨停板早知道：七大利好有望发酵</a><span>(11月10日 19:09)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-09/doc-iiznezxs0925830.shtml\" target=\"_blank\">辉瑞新冠疫苗有效性超90% 这些疫苗概念股可关注(附股)</a><span>(11月09日 22:42)</span></li>\n",
      "</ul> <ul class=\"list_009\">\n",
      "<li><a href=\"https://finance.sina.com.cn/stock/e/2020-11-09/doc-iiznezxs0901497.shtml\" target=\"_blank\">2020年11月10日涨停板早知道：七大利好有望发酵</a><span>(11月09日 19:33)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-09/doc-iiznctke0462391.shtml\" target=\"_blank\">库存去化顺畅：下游汽车家电需求旺盛 钢市春天要来了？</a><span>(11月09日 17:36)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/e/2020-11-08/doc-iiznezxs0691123.shtml\" target=\"_blank\">2020年11月9日涨停板早知道：七大利好有望发酵</a><span>(11月08日 17:58)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznezxs0312967.shtml\" target=\"_blank\">航运行业持续高景气 机构：集装箱紧缺状态至少持续半年</a><span>(11月06日 15:07)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9882863.shtml\" target=\"_blank\">冠脉支架集采结果出炉：中标价大幅下降 相关公司股价承压(附股)</a><span>(11月06日 14:25)</span></li>\n",
      "</ul> <ul class=\"list_009\">\n",
      "<li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznezxs0307628.shtml\" target=\"_blank\">“智慧停车”朋友圈再扩容：行业已被资本瞄准 概念股一网打尽</a><span>(11月06日 14:25)</span></li> <li><a href=\"https://finance.sina.com.cn/roll/2020-11-06/doc-iiznctkc9874946.shtml\" target=\"_blank\">钢铁股逆市走高 机构建议关注特钢龙头（附股）</a><span>(11月06日 13:53)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznezxs0290017.shtml\" target=\"_blank\">广东到2035年通用机场服务将覆盖所有县 相关产业链公司受关注</a><span>(11月06日 13:40)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9839653.shtml\" target=\"_blank\">国信证券：地产估值已经处于短周期底部 推荐5股</a><span>(11月06日 11:14)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9837908.shtml\" target=\"_blank\">国信证券：冠脉支架集采中标价大降 短期利润空间受影响</a><span>(11月06日 11:01)</span></li>\n",
      "</ul> <ul class=\"list_009\">\n",
      "<li><a href=\"https://finance.sina.com.cn/roll/2020-11-06/doc-iiznezxs0255078.shtml\" target=\"_blank\">美元大跌黄金大涨：相关概念股集体躁动 机构推荐5股</a><span>(11月06日 11:00)</span></li> <li><a href=\"https://finance.sina.com.cn/roll/2020-11-06/doc-iiznezxs0254884.shtml\" target=\"_blank\">券商股走强：全面实行注册制号角吹响 把握改革红利(附股)</a><span>(11月06日 10:59)</span></li> <li><a href=\"https://finance.sina.com.cn/roll/2020-11-06/doc-iiznezxs0239160.shtml\" target=\"_blank\">券商板块强势拉升：国金证券一度涨停 中金公司连续大涨</a><span>(11月06日 10:03)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9819556.shtml\" target=\"_blank\">任天堂营业利润大增超两倍 switch成为最畅销游戏机(附股)</a><span>(11月06日 09:54)</span></li> <li><a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9819038.shtml\" target=\"_blank\">多家车企10月销量增势明显 关注这些细分领域个股</a><span>(11月06日 09:53)</span></li>\n",
      "</ul><div class=\"hs01\"> </div>\n",
      "<table cellspacing=\"0\" style=\"margin:0 auto;\">\n",
      "<tbody>\n",
      "<tr>\n",
      "<td>\n",
      "<span class=\"pagebox\">\n",
      "<span class=\"pagebox_pre_nolink\">上一页</span>\n",
      "<span class=\"pagebox_num_nonce\">1</span> <span class=\"pagebox_num\">\n",
      "<a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=2\">2</a>\n",
      "</span> <span class=\"pagebox_num\">\n",
      "<a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=3\">3</a>\n",
      "</span> <span class=\"pagebox_num\">\n",
      "<a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=4\">4</a>\n",
      "</span> <span class=\"pagebox_num\">\n",
      "<a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=5\">5</a>\n",
      "</span>\n",
      "<span class=\"pagebox_next\">\n",
      "<a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=2\">下一页</a></span>\n",
      "</span>\n",
      "</td>\n",
      "</tr>\n",
      "</tbody>\n",
      "</table><!-- 分页 end -->\n",
      "\n",
      "\n",
      "\t\n",
      "\n",
      "\n",
      "​\t\t        \n",
      "​\t\n",
      "\n",
      "\n",
      "</body>\n",
      "</html>\n"
     ]
    }
   ],
   "source": [
    "print(soup)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n",
      "<html xmlns=\"http://www.w3.org/1999/xhtml\">\n",
      " <body>\n",
      "  <p>\n",
      "   ​\n",
      "  </p>\n",
      "  <title>\n",
      "   个股点评_证券_新浪财经\n",
      "  </title>\n",
      "  <meta content=\"个股点评_证券_新浪财经\" name=\"Keywords\"/>\n",
      "  <div class=\"hs01\">\n",
      "  </div>\n",
      "  <ul class=\"list_009\">\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/zqgd/2020-11-22/doc-iiznezxs3063378.shtml\" target=\"_blank\">\n",
      "     *ST欧浦或面临退市:因公司控股股东佛山市中基投资宣告破产\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月22日 07:17)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-20/doc-iiznctke2383291.shtml\" target=\"_blank\">\n",
      "     拉尼娜来袭 一文看清相关行业投资机会（附股）\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月20日 11:25)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-20/doc-iiznezxs2782833.shtml\" target=\"_blank\">\n",
      "     暴雪肆虐冷空气“发威”：煤炭供应趋紧 这些厂商躺赢？\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月20日 08:55)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/e/2020-11-19/doc-iiznezxs2717397.shtml\" target=\"_blank\">\n",
      "     2020年11月20日涨停板早知道：七大利好有望发酵\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月19日 20:05)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-19/doc-iiznctke2237870.shtml\" target=\"_blank\">\n",
      "     三大运营商或于年底宣布5G消息商用 产业链标的有望受益（附股）\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月19日 14:22)\n",
      "    </span>\n",
      "   </li>\n",
      "  </ul>\n",
      "  <ul class=\"list_009\">\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/roll/2020-11-19/doc-iiznezxs2652793.shtml\" target=\"_blank\">\n",
      "     军工股午后崛起：航空产业链业绩提升 订单量增速有望扩大\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月19日 13:33)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/roll/2020-11-19/doc-iiznctke2194128.shtml\" target=\"_blank\">\n",
      "     国常会再提促进家电消费：家电股迎政策红利 两条主线布局\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月19日 10:29)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-19/doc-iiznezxs2611976.shtml\" target=\"_blank\">\n",
      "     涨价题材火爆：有机硅价格创年内新高 最全概念股名单来了\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月19日 09:45)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/e/2020-11-18/doc-iiznezxs2537290.shtml\" target=\"_blank\">\n",
      "     2020年11月19日涨停板早知道：七大利好有望发酵\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月18日 19:26)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/s/2020-11-18/doc-iiznezxs2427923.shtml\" target=\"_blank\">\n",
      "     前三季中国拿下世界造船业半数订单 成全球重要造船中心(股)\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月18日 09:28)\n",
      "    </span>\n",
      "   </li>\n",
      "  </ul>\n",
      "  <ul class=\"list_009\">\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-18/doc-iiznctke2000118.shtml\" target=\"_blank\">\n",
      "     手机摄像头出货量回暖：多摄趋势加速渗透 产业链有望持续受益\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月18日 08:58)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/e/2020-11-17/doc-iiznctke1936558.shtml\" target=\"_blank\">\n",
      "     2020年11月18日涨停板早知道：七大利好有望发酵\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月17日 19:36)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-17/doc-iiznctke1819613.shtml\" target=\"_blank\">\n",
      "     能源工业云网正式发布 赋能能源产业链(附股)\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月17日 09:18)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-17/doc-iiznezxs2241938.shtml\" target=\"_blank\">\n",
      "     10月装车辆同比翻倍：磷酸铁锂强势回归 龙头股价迭创新高(股)\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月17日 08:53)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-17/doc-iiznctke1819184.shtml\" target=\"_blank\">\n",
      "     全球首款定制网约车来了：滴滴出行携手比亚迪 概念股站上风口\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月17日 08:53)\n",
      "    </span>\n",
      "   </li>\n",
      "  </ul>\n",
      "  <ul class=\"list_009\">\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/e/2020-11-16/doc-iiznctke1746505.shtml\" target=\"_blank\">\n",
      "     2020年11月17日涨停板早知道：七大利好有望发酵\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月16日 19:07)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-16/doc-iiznctke1667121.shtml\" target=\"_blank\">\n",
      "     有色板块多股涨停：电解铝、稀土价格有望持续修复反弹(附股)\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月16日 11:33)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-16/doc-iiznezxs2076170.shtml\" target=\"_blank\">\n",
      "     医美板块大涨：疫情趋稳需求恢复 三条赛道布局医疗美容(股)\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月16日 10:48)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-16/doc-iiznctke1638867.shtml\" target=\"_blank\">\n",
      "     疫苗超低温冰柜脱销 冷链板块有望重返高光时刻？(名单)\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月16日 08:55)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-16/doc-iiznezxs2058212.shtml\" target=\"_blank\">\n",
      "     全球最大自贸协定达成：零关税产品超90% 概念股名单来了\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月16日 08:46)\n",
      "    </span>\n",
      "   </li>\n",
      "  </ul>\n",
      "  <ul class=\"list_009\">\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/e/2020-11-15/doc-iiznctke1577782.shtml\" target=\"_blank\">\n",
      "     2020年11月16日涨停板早知道：七大利好有望发酵\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月15日 19:36)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-13/doc-iiznezxs1746412.shtml\" target=\"_blank\">\n",
      "     政策暖风频吹：多地抓紧布局 关注燃料电池产业链\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月13日 21:36)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-13/doc-iiznezxs1731872.shtml\" target=\"_blank\">\n",
      "     车联网细分赛道迎重大风口：板块概念股名单来了\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月13日 19:37)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-13/doc-iiznezxs1713560.shtml\" target=\"_blank\">\n",
      "     旺季开锣：冰雪旅游预订量飙涨300倍 相关概念股全梳理\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月13日 17:32)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/e/2020-11-12/doc-iiznezxs1532374.shtml\" target=\"_blank\">\n",
      "     2020年11月13日涨停板早知道：七大利好有望发酵\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月12日 19:27)\n",
      "    </span>\n",
      "   </li>\n",
      "  </ul>\n",
      "  <ul class=\"list_009\">\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-11/doc-iiznctke0924923.shtml\" target=\"_blank\">\n",
      "     顺周期概念股全面爆发 还有哪些板块可挖掘？\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月11日 21:45)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/e/2020-11-11/doc-iiznctke0909075.shtml\" target=\"_blank\">\n",
      "     2020年11月12日涨停板早知道：七大利好有望发酵\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月11日 19:35)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-11/doc-iiznezxs1281954.shtml\" target=\"_blank\">\n",
      "     江苏省发布区块链产业发展计划 相关应用有望提速(附股)\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月11日 15:04)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/e/2020-11-10/doc-iiznezxs1113487.shtml\" target=\"_blank\">\n",
      "     2020年11月11日涨停板早知道：七大利好有望发酵\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月10日 19:09)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-09/doc-iiznezxs0925830.shtml\" target=\"_blank\">\n",
      "     辉瑞新冠疫苗有效性超90% 这些疫苗概念股可关注(附股)\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月09日 22:42)\n",
      "    </span>\n",
      "   </li>\n",
      "  </ul>\n",
      "  <ul class=\"list_009\">\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/e/2020-11-09/doc-iiznezxs0901497.shtml\" target=\"_blank\">\n",
      "     2020年11月10日涨停板早知道：七大利好有望发酵\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月09日 19:33)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-09/doc-iiznctke0462391.shtml\" target=\"_blank\">\n",
      "     库存去化顺畅：下游汽车家电需求旺盛 钢市春天要来了？\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月09日 17:36)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/e/2020-11-08/doc-iiznezxs0691123.shtml\" target=\"_blank\">\n",
      "     2020年11月9日涨停板早知道：七大利好有望发酵\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月08日 17:58)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznezxs0312967.shtml\" target=\"_blank\">\n",
      "     航运行业持续高景气 机构：集装箱紧缺状态至少持续半年\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月06日 15:07)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9882863.shtml\" target=\"_blank\">\n",
      "     冠脉支架集采结果出炉：中标价大幅下降 相关公司股价承压(附股)\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月06日 14:25)\n",
      "    </span>\n",
      "   </li>\n",
      "  </ul>\n",
      "  <ul class=\"list_009\">\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznezxs0307628.shtml\" target=\"_blank\">\n",
      "     “智慧停车”朋友圈再扩容：行业已被资本瞄准 概念股一网打尽\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月06日 14:25)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/roll/2020-11-06/doc-iiznctkc9874946.shtml\" target=\"_blank\">\n",
      "     钢铁股逆市走高 机构建议关注特钢龙头（附股）\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月06日 13:53)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznezxs0290017.shtml\" target=\"_blank\">\n",
      "     广东到2035年通用机场服务将覆盖所有县 相关产业链公司受关注\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月06日 13:40)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9839653.shtml\" target=\"_blank\">\n",
      "     国信证券：地产估值已经处于短周期底部 推荐5股\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月06日 11:14)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9837908.shtml\" target=\"_blank\">\n",
      "     国信证券：冠脉支架集采中标价大降 短期利润空间受影响\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月06日 11:01)\n",
      "    </span>\n",
      "   </li>\n",
      "  </ul>\n",
      "  <ul class=\"list_009\">\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/roll/2020-11-06/doc-iiznezxs0255078.shtml\" target=\"_blank\">\n",
      "     美元大跌黄金大涨：相关概念股集体躁动 机构推荐5股\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月06日 11:00)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/roll/2020-11-06/doc-iiznezxs0254884.shtml\" target=\"_blank\">\n",
      "     券商股走强：全面实行注册制号角吹响 把握改革红利(附股)\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月06日 10:59)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/roll/2020-11-06/doc-iiznezxs0239160.shtml\" target=\"_blank\">\n",
      "     券商板块强势拉升：国金证券一度涨停 中金公司连续大涨\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月06日 10:03)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9819556.shtml\" target=\"_blank\">\n",
      "     任天堂营业利润大增超两倍 switch成为最畅销游戏机(附股)\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月06日 09:54)\n",
      "    </span>\n",
      "   </li>\n",
      "   <li>\n",
      "    <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9819038.shtml\" target=\"_blank\">\n",
      "     多家车企10月销量增势明显 关注这些细分领域个股\n",
      "    </a>\n",
      "    <span>\n",
      "     (11月06日 09:53)\n",
      "    </span>\n",
      "   </li>\n",
      "  </ul>\n",
      "  <div class=\"hs01\">\n",
      "  </div>\n",
      "  <table cellspacing=\"0\" style=\"margin:0 auto;\">\n",
      "   <tbody>\n",
      "    <tr>\n",
      "     <td>\n",
      "      <span class=\"pagebox\">\n",
      "       <span class=\"pagebox_pre_nolink\">\n",
      "        上一页\n",
      "       </span>\n",
      "       <span class=\"pagebox_num_nonce\">\n",
      "        1\n",
      "       </span>\n",
      "       <span class=\"pagebox_num\">\n",
      "        <a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=2\">\n",
      "         2\n",
      "        </a>\n",
      "       </span>\n",
      "       <span class=\"pagebox_num\">\n",
      "        <a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=3\">\n",
      "         3\n",
      "        </a>\n",
      "       </span>\n",
      "       <span class=\"pagebox_num\">\n",
      "        <a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=4\">\n",
      "         4\n",
      "        </a>\n",
      "       </span>\n",
      "       <span class=\"pagebox_num\">\n",
      "        <a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=5\">\n",
      "         5\n",
      "        </a>\n",
      "       </span>\n",
      "       <span class=\"pagebox_next\">\n",
      "        <a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=2\">\n",
      "         下一页\n",
      "        </a>\n",
      "       </span>\n",
      "      </span>\n",
      "     </td>\n",
      "    </tr>\n",
      "   </tbody>\n",
      "  </table>\n",
      "  <!-- 分页 end -->\n",
      "  ​\t\t        \n",
      "​\n",
      " </body>\n",
      "</html>\n"
     ]
    }
   ],
   "source": [
    "# 我们来打印一下 soup 对象的内容，格式化输出\n",
    "\n",
    "print(soup.prettify())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Beautiful Soup 将复杂 HTML 文档转换成一个复杂的树形结构，每个节点都是 Python 对象\n",
    "\n",
    "我们主要来熟悉一下Tag的对象\n",
    "\n",
    "Tag 是什么？通俗点讲就是 HTML 中的一个个标签，例如\n",
    "\n",
    "``<title>个股点评_证券_新浪财经</title>\n",
    "``\n",
    "\n",
    "``<a href=\"https://finance.sina.com.cn/stock/zqgd/2020-11-22/doc-iiznezxs3063378.shtml\" target=\"_blank\">*ST欧浦或面临退市:因公司控股股东佛山市中基投资宣告破产</a>\n",
    "``\n",
    "\n",
    "这里面<title>,<a>都是属于标签，利用 Beautiful Soup可以非常方便的将标签中的信息提取出来"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<title>个股点评_证券_新浪财经</title>"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.title"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'bs4.element.Tag'>\n"
     ]
    }
   ],
   "source": [
    "# 看一下它是个什么东西 \n",
    "\n",
    "print(type(soup.title))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'title'"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 对于 Tag，它有两个重要的属性，是 name 和 attrs\n",
    "\n",
    "soup.title.name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'个股点评_证券_新浪财经'"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.title.text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<a href=\"https://finance.sina.com.cn/stock/zqgd/2020-11-22/doc-iiznezxs3063378.shtml\" target=\"_blank\">*ST欧浦或面临退市:因公司控股股东佛山市中基投资宣告破产</a>"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'href': 'https://finance.sina.com.cn/stock/zqgd/2020-11-22/doc-iiznezxs3063378.shtml',\n",
       " 'target': '_blank'}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.a.attrs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://finance.sina.com.cn/stock/zqgd/2020-11-22/doc-iiznezxs3063378.shtml'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 在这里，我们把 a 标签的所有属性打印输出了出来，得到的类型是一个字典。 如果我们想要单独获取某个属性，可以这样\n",
    "soup.a['href']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://finance.sina.com.cn/stock/zqgd/2020-11-22/doc-iiznezxs3063378.shtml'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 还可以这样，利用 get 方法，传入属性的名称，二者是等价的\n",
    "soup.a.get('href')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 我们可以对这些属性和内容等等进行修改，或者删除，不过作为爬虫来说不需要。因为你没有权限修改，另外在加载的时候你不能选择只读 ‘r’\n",
    "\n",
    "# 修改\n",
    "# soup.a['href']=\"http://sina.com.cn\"\n",
    "\n",
    "# 删除\n",
    "# del soup.a['target']\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NavigableString**\n",
    "\n",
    "然我们已经得到了标签的内容，那么问题来了，我们要想获取标签内部的文字怎么办呢？很简单，用 .string或者 .text，例如"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'*ST欧浦或面临退市:因公司控股股东佛山市中基投资宣告破产'"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.a.string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'*ST欧浦或面临退市:因公司控股股东佛山市中基投资宣告破产'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.a.text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'个股点评_证券_新浪财经'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.title.string"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**遍历文档树：直接子节点**\n",
    "\n",
    "要点：.contents .children 属性"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['*ST欧浦或面临退市:因公司控股股东佛山市中基投资宣告破产']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.a.contents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "<tbody>\n",
      "<tr>\n",
      "<td>\n",
      "<span class=\"pagebox\">\n",
      "<span class=\"pagebox_pre_nolink\">上一页</span>\n",
      "<span class=\"pagebox_num_nonce\">1</span> <span class=\"pagebox_num\">\n",
      "<a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=2\">2</a>\n",
      "</span> <span class=\"pagebox_num\">\n",
      "<a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=3\">3</a>\n",
      "</span> <span class=\"pagebox_num\">\n",
      "<a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=4\">4</a>\n",
      "</span> <span class=\"pagebox_num\">\n",
      "<a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=5\">5</a>\n",
      "</span>\n",
      "<span class=\"pagebox_next\">\n",
      "<a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=2\">下一页</a></span>\n",
      "</span>\n",
      "</td>\n",
      "</tr>\n",
      "</tbody>\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for child in soup.table.children:\n",
    "    print(child)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "除了children之外，还有父节点，孙节点，兄弟节点等多项相关内容，请同学们通过查阅官网或者参阅其他参考资料进行学习"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**节点内容：多个内容情况**\n",
    "\n",
    "对于单个内容来说可以利用.string 或者 .text 来获得，如果有多个内容的话"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<generator object Tag._all_strings at 0x103dd6a50>"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.a.strings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "'\\n'\n",
      "'*ST欧浦或面临退市:因公司控股股东佛山市中基投资宣告破产'\n",
      "'(11月22日 07:17)'\n",
      "' '\n",
      "'拉尼娜来袭 一文看清相关行业投资机会（附股）'\n",
      "'(11月20日 11:25)'\n",
      "' '\n",
      "'暴雪肆虐冷空气“发威”：煤炭供应趋紧 这些厂商躺赢？'\n",
      "'(11月20日 08:55)'\n",
      "' '\n",
      "'2020年11月20日涨停板早知道：七大利好有望发酵'\n",
      "'(11月19日 20:05)'\n",
      "' '\n",
      "'三大运营商或于年底宣布5G消息商用 产业链标的有望受益（附股）'\n",
      "'(11月19日 14:22)'\n",
      "'\\n'\n"
     ]
    }
   ],
   "source": [
    "for t in soup.ul.strings:\n",
    "    # repr() 函数将对象转化为供解释器读取的形式\n",
    "    print(repr(t))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "*ST欧浦或面临退市:因公司控股股东佛山市中基投资宣告破产\n",
      "(11月22日 07:17)\n",
      "拉尼娜来袭 一文看清相关行业投资机会（附股）\n",
      "(11月20日 11:25)\n",
      "暴雪肆虐冷空气“发威”：煤炭供应趋紧 这些厂商躺赢？\n",
      "(11月20日 08:55)\n",
      "2020年11月20日涨停板早知道：七大利好有望发酵\n",
      "(11月19日 20:05)\n",
      "三大运营商或于年底宣布5G消息商用 产业链标的有望受益（附股）\n",
      "(11月19日 14:22)\n"
     ]
    }
   ],
   "source": [
    "for t in soup.ul.stripped_strings:\n",
    "    print(t)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "上一页\n",
      "1\n",
      "2\n",
      "3\n",
      "4\n",
      "5\n",
      "下一页\n"
     ]
    }
   ],
   "source": [
    "for t in soup.table.stripped_strings:\n",
    "    print(t)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 搜索文档树 \n",
    "\n",
    "很多时候网页内容太多，需要使用搜索而不是遍历文档树\n",
    "\n",
    "因此我们需要使用 .find 类的命令\n",
    "\n",
    "find_all () 方法搜索当前 tag 的所有 tag 子节点，并判断是否符合过滤器的条件 1）name 参数 name 参数可以查找所有名字为 name 的 tag, 字符串对象会被自动忽略掉  参数\n",
    "\n",
    "1. Name 参数\n",
    "\n",
    "**A. 传字符串** "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<a href=\"https://finance.sina.com.cn/stock/zqgd/2020-11-22/doc-iiznezxs3063378.shtml\" target=\"_blank\">*ST欧浦或面临退市:因公司控股股东佛山市中基投资宣告破产</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-20/doc-iiznctke2383291.shtml\" target=\"_blank\">拉尼娜来袭 一文看清相关行业投资机会（附股）</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-20/doc-iiznezxs2782833.shtml\" target=\"_blank\">暴雪肆虐冷空气“发威”：煤炭供应趋紧 这些厂商躺赢？</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-19/doc-iiznezxs2717397.shtml\" target=\"_blank\">2020年11月20日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-19/doc-iiznctke2237870.shtml\" target=\"_blank\">三大运营商或于年底宣布5G消息商用 产业链标的有望受益（附股）</a>,\n",
       " <a href=\"https://finance.sina.com.cn/roll/2020-11-19/doc-iiznezxs2652793.shtml\" target=\"_blank\">军工股午后崛起：航空产业链业绩提升 订单量增速有望扩大</a>,\n",
       " <a href=\"https://finance.sina.com.cn/roll/2020-11-19/doc-iiznctke2194128.shtml\" target=\"_blank\">国常会再提促进家电消费：家电股迎政策红利 两条主线布局</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-19/doc-iiznezxs2611976.shtml\" target=\"_blank\">涨价题材火爆：有机硅价格创年内新高 最全概念股名单来了</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-18/doc-iiznezxs2537290.shtml\" target=\"_blank\">2020年11月19日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/s/2020-11-18/doc-iiznezxs2427923.shtml\" target=\"_blank\">前三季中国拿下世界造船业半数订单 成全球重要造船中心(股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-18/doc-iiznctke2000118.shtml\" target=\"_blank\">手机摄像头出货量回暖：多摄趋势加速渗透 产业链有望持续受益</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-17/doc-iiznctke1936558.shtml\" target=\"_blank\">2020年11月18日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-17/doc-iiznctke1819613.shtml\" target=\"_blank\">能源工业云网正式发布 赋能能源产业链(附股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-17/doc-iiznezxs2241938.shtml\" target=\"_blank\">10月装车辆同比翻倍：磷酸铁锂强势回归 龙头股价迭创新高(股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-17/doc-iiznctke1819184.shtml\" target=\"_blank\">全球首款定制网约车来了：滴滴出行携手比亚迪 概念股站上风口</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-16/doc-iiznctke1746505.shtml\" target=\"_blank\">2020年11月17日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-16/doc-iiznctke1667121.shtml\" target=\"_blank\">有色板块多股涨停：电解铝、稀土价格有望持续修复反弹(附股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-16/doc-iiznezxs2076170.shtml\" target=\"_blank\">医美板块大涨：疫情趋稳需求恢复 三条赛道布局医疗美容(股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-16/doc-iiznctke1638867.shtml\" target=\"_blank\">疫苗超低温冰柜脱销 冷链板块有望重返高光时刻？(名单)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-16/doc-iiznezxs2058212.shtml\" target=\"_blank\">全球最大自贸协定达成：零关税产品超90% 概念股名单来了</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-15/doc-iiznctke1577782.shtml\" target=\"_blank\">2020年11月16日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-13/doc-iiznezxs1746412.shtml\" target=\"_blank\">政策暖风频吹：多地抓紧布局 关注燃料电池产业链</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-13/doc-iiznezxs1731872.shtml\" target=\"_blank\">车联网细分赛道迎重大风口：板块概念股名单来了</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-13/doc-iiznezxs1713560.shtml\" target=\"_blank\">旺季开锣：冰雪旅游预订量飙涨300倍 相关概念股全梳理</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-12/doc-iiznezxs1532374.shtml\" target=\"_blank\">2020年11月13日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-11/doc-iiznctke0924923.shtml\" target=\"_blank\">顺周期概念股全面爆发 还有哪些板块可挖掘？</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-11/doc-iiznctke0909075.shtml\" target=\"_blank\">2020年11月12日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-11/doc-iiznezxs1281954.shtml\" target=\"_blank\">江苏省发布区块链产业发展计划 相关应用有望提速(附股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-10/doc-iiznezxs1113487.shtml\" target=\"_blank\">2020年11月11日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-09/doc-iiznezxs0925830.shtml\" target=\"_blank\">辉瑞新冠疫苗有效性超90% 这些疫苗概念股可关注(附股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-09/doc-iiznezxs0901497.shtml\" target=\"_blank\">2020年11月10日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-09/doc-iiznctke0462391.shtml\" target=\"_blank\">库存去化顺畅：下游汽车家电需求旺盛 钢市春天要来了？</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-08/doc-iiznezxs0691123.shtml\" target=\"_blank\">2020年11月9日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznezxs0312967.shtml\" target=\"_blank\">航运行业持续高景气 机构：集装箱紧缺状态至少持续半年</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9882863.shtml\" target=\"_blank\">冠脉支架集采结果出炉：中标价大幅下降 相关公司股价承压(附股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznezxs0307628.shtml\" target=\"_blank\">“智慧停车”朋友圈再扩容：行业已被资本瞄准 概念股一网打尽</a>,\n",
       " <a href=\"https://finance.sina.com.cn/roll/2020-11-06/doc-iiznctkc9874946.shtml\" target=\"_blank\">钢铁股逆市走高 机构建议关注特钢龙头（附股）</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznezxs0290017.shtml\" target=\"_blank\">广东到2035年通用机场服务将覆盖所有县 相关产业链公司受关注</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9839653.shtml\" target=\"_blank\">国信证券：地产估值已经处于短周期底部 推荐5股</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9837908.shtml\" target=\"_blank\">国信证券：冠脉支架集采中标价大降 短期利润空间受影响</a>,\n",
       " <a href=\"https://finance.sina.com.cn/roll/2020-11-06/doc-iiznezxs0255078.shtml\" target=\"_blank\">美元大跌黄金大涨：相关概念股集体躁动 机构推荐5股</a>,\n",
       " <a href=\"https://finance.sina.com.cn/roll/2020-11-06/doc-iiznezxs0254884.shtml\" target=\"_blank\">券商股走强：全面实行注册制号角吹响 把握改革红利(附股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/roll/2020-11-06/doc-iiznezxs0239160.shtml\" target=\"_blank\">券商板块强势拉升：国金证券一度涨停 中金公司连续大涨</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9819556.shtml\" target=\"_blank\">任天堂营业利润大增超两倍 switch成为最畅销游戏机(附股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9819038.shtml\" target=\"_blank\">多家车企10月销量增势明显 关注这些细分领域个股</a>,\n",
       " <a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=2\">2</a>,\n",
       " <a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=3\">3</a>,\n",
       " <a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=4\">4</a>,\n",
       " <a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=5\">5</a>,\n",
       " <a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=2\">下一页</a>]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.find_all('a')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**B. 传正则表达式**\n",
    "\n",
    "如果传入正则表达式作为参数，Beautiful Soup 会通过正则表达式的 match () 来匹配内容。下面例子中找出所有以 t 开头的标签"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "html\n",
      "title\n",
      "meta\n",
      "table\n",
      "tbody\n",
      "tr\n",
      "td\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "for tag in soup.find_all(re.compile(\"t\")):\n",
    "    print(tag.name)\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. keyword 参数\n",
    "\n",
    "如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字 tag 的属性来搜索，如果包含一个名字为 x 的参数，Beautiful Soup 会搜索每个 tag 的 x 属性。举例如下\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<a href=\"https://finance.sina.com.cn/stock/zqgd/2020-11-22/doc-iiznezxs3063378.shtml\" target=\"_blank\">*ST欧浦或面临退市:因公司控股股东佛山市中基投资宣告破产</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-20/doc-iiznctke2383291.shtml\" target=\"_blank\">拉尼娜来袭 一文看清相关行业投资机会（附股）</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-20/doc-iiznezxs2782833.shtml\" target=\"_blank\">暴雪肆虐冷空气“发威”：煤炭供应趋紧 这些厂商躺赢？</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-19/doc-iiznezxs2717397.shtml\" target=\"_blank\">2020年11月20日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-19/doc-iiznctke2237870.shtml\" target=\"_blank\">三大运营商或于年底宣布5G消息商用 产业链标的有望受益（附股）</a>,\n",
       " <a href=\"https://finance.sina.com.cn/roll/2020-11-19/doc-iiznezxs2652793.shtml\" target=\"_blank\">军工股午后崛起：航空产业链业绩提升 订单量增速有望扩大</a>,\n",
       " <a href=\"https://finance.sina.com.cn/roll/2020-11-19/doc-iiznctke2194128.shtml\" target=\"_blank\">国常会再提促进家电消费：家电股迎政策红利 两条主线布局</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-19/doc-iiznezxs2611976.shtml\" target=\"_blank\">涨价题材火爆：有机硅价格创年内新高 最全概念股名单来了</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-18/doc-iiznezxs2537290.shtml\" target=\"_blank\">2020年11月19日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/s/2020-11-18/doc-iiznezxs2427923.shtml\" target=\"_blank\">前三季中国拿下世界造船业半数订单 成全球重要造船中心(股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-18/doc-iiznctke2000118.shtml\" target=\"_blank\">手机摄像头出货量回暖：多摄趋势加速渗透 产业链有望持续受益</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-17/doc-iiznctke1936558.shtml\" target=\"_blank\">2020年11月18日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-17/doc-iiznctke1819613.shtml\" target=\"_blank\">能源工业云网正式发布 赋能能源产业链(附股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-17/doc-iiznezxs2241938.shtml\" target=\"_blank\">10月装车辆同比翻倍：磷酸铁锂强势回归 龙头股价迭创新高(股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-17/doc-iiznctke1819184.shtml\" target=\"_blank\">全球首款定制网约车来了：滴滴出行携手比亚迪 概念股站上风口</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-16/doc-iiznctke1746505.shtml\" target=\"_blank\">2020年11月17日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-16/doc-iiznctke1667121.shtml\" target=\"_blank\">有色板块多股涨停：电解铝、稀土价格有望持续修复反弹(附股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-16/doc-iiznezxs2076170.shtml\" target=\"_blank\">医美板块大涨：疫情趋稳需求恢复 三条赛道布局医疗美容(股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-16/doc-iiznctke1638867.shtml\" target=\"_blank\">疫苗超低温冰柜脱销 冷链板块有望重返高光时刻？(名单)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-16/doc-iiznezxs2058212.shtml\" target=\"_blank\">全球最大自贸协定达成：零关税产品超90% 概念股名单来了</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-15/doc-iiznctke1577782.shtml\" target=\"_blank\">2020年11月16日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-13/doc-iiznezxs1746412.shtml\" target=\"_blank\">政策暖风频吹：多地抓紧布局 关注燃料电池产业链</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-13/doc-iiznezxs1731872.shtml\" target=\"_blank\">车联网细分赛道迎重大风口：板块概念股名单来了</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-13/doc-iiznezxs1713560.shtml\" target=\"_blank\">旺季开锣：冰雪旅游预订量飙涨300倍 相关概念股全梳理</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-12/doc-iiznezxs1532374.shtml\" target=\"_blank\">2020年11月13日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-11/doc-iiznctke0924923.shtml\" target=\"_blank\">顺周期概念股全面爆发 还有哪些板块可挖掘？</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-11/doc-iiznctke0909075.shtml\" target=\"_blank\">2020年11月12日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-11/doc-iiznezxs1281954.shtml\" target=\"_blank\">江苏省发布区块链产业发展计划 相关应用有望提速(附股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-10/doc-iiznezxs1113487.shtml\" target=\"_blank\">2020年11月11日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-09/doc-iiznezxs0925830.shtml\" target=\"_blank\">辉瑞新冠疫苗有效性超90% 这些疫苗概念股可关注(附股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-09/doc-iiznezxs0901497.shtml\" target=\"_blank\">2020年11月10日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-09/doc-iiznctke0462391.shtml\" target=\"_blank\">库存去化顺畅：下游汽车家电需求旺盛 钢市春天要来了？</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/e/2020-11-08/doc-iiznezxs0691123.shtml\" target=\"_blank\">2020年11月9日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznezxs0312967.shtml\" target=\"_blank\">航运行业持续高景气 机构：集装箱紧缺状态至少持续半年</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9882863.shtml\" target=\"_blank\">冠脉支架集采结果出炉：中标价大幅下降 相关公司股价承压(附股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznezxs0307628.shtml\" target=\"_blank\">“智慧停车”朋友圈再扩容：行业已被资本瞄准 概念股一网打尽</a>,\n",
       " <a href=\"https://finance.sina.com.cn/roll/2020-11-06/doc-iiznctkc9874946.shtml\" target=\"_blank\">钢铁股逆市走高 机构建议关注特钢龙头（附股）</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznezxs0290017.shtml\" target=\"_blank\">广东到2035年通用机场服务将覆盖所有县 相关产业链公司受关注</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9839653.shtml\" target=\"_blank\">国信证券：地产估值已经处于短周期底部 推荐5股</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9837908.shtml\" target=\"_blank\">国信证券：冠脉支架集采中标价大降 短期利润空间受影响</a>,\n",
       " <a href=\"https://finance.sina.com.cn/roll/2020-11-06/doc-iiznezxs0255078.shtml\" target=\"_blank\">美元大跌黄金大涨：相关概念股集体躁动 机构推荐5股</a>,\n",
       " <a href=\"https://finance.sina.com.cn/roll/2020-11-06/doc-iiznezxs0254884.shtml\" target=\"_blank\">券商股走强：全面实行注册制号角吹响 把握改革红利(附股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/roll/2020-11-06/doc-iiznezxs0239160.shtml\" target=\"_blank\">券商板块强势拉升：国金证券一度涨停 中金公司连续大涨</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9819556.shtml\" target=\"_blank\">任天堂营业利润大增超两倍 switch成为最畅销游戏机(附股)</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-06/doc-iiznctkc9819038.shtml\" target=\"_blank\">多家车企10月销量增势明显 关注这些细分领域个股</a>]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.find_all(target='_blank')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<a href=\"https://finance.sina.com.cn/stock/e/2020-11-19/doc-iiznezxs2717397.shtml\" target=\"_blank\">2020年11月20日涨停板早知道：七大利好有望发酵</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-19/doc-iiznctke2237870.shtml\" target=\"_blank\">三大运营商或于年底宣布5G消息商用 产业链标的有望受益（附股）</a>,\n",
       " <a href=\"https://finance.sina.com.cn/roll/2020-11-19/doc-iiznezxs2652793.shtml\" target=\"_blank\">军工股午后崛起：航空产业链业绩提升 订单量增速有望扩大</a>,\n",
       " <a href=\"https://finance.sina.com.cn/roll/2020-11-19/doc-iiznctke2194128.shtml\" target=\"_blank\">国常会再提促进家电消费：家电股迎政策红利 两条主线布局</a>,\n",
       " <a href=\"https://finance.sina.com.cn/stock/hyyj/2020-11-19/doc-iiznezxs2611976.shtml\" target=\"_blank\">涨价题材火爆：有机硅价格创年内新高 最全概念股名单来了</a>]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.find_all(href=re.compile(\"2020-11-19\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=2\">2</a>,\n",
       " <a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=3\">3</a>,\n",
       " <a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=4\">4</a>,\n",
       " <a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=5\">5</a>,\n",
       " <a href=\"http://finance.sina.com.cn/roll/index.d.html?cid=56588&amp;page=2\">下一页</a>]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.find_all(href=re.compile(\"page\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 问题\n",
    "\n",
    "如果我想知道最大页码是多少如何办到？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['2020年11月20日涨停板早知道：七大利好有望发酵',\n",
       " '军工股午后崛起：航空产业链业绩提升 订单量增速有望扩大',\n",
       " '涨价题材火爆：有机硅价格创年内新高 最全概念股名单来了',\n",
       " '2020年11月19日涨停板早知道：七大利好有望发酵',\n",
       " '2020年11月18日涨停板早知道：七大利好有望发酵',\n",
       " '10月装车辆同比翻倍：磷酸铁锂强势回归 龙头股价迭创新高(股)',\n",
       " '2020年11月17日涨停板早知道：七大利好有望发酵',\n",
       " '2020年11月16日涨停板早知道：七大利好有望发酵',\n",
       " '2020年11月13日涨停板早知道：七大利好有望发酵',\n",
       " '2020年11月12日涨停板早知道：七大利好有望发酵',\n",
       " '2020年11月11日涨停板早知道：七大利好有望发酵',\n",
       " '2020年11月10日涨停板早知道：七大利好有望发酵',\n",
       " '2020年11月9日涨停板早知道：七大利好有望发酵']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.find_all(text=[re.compile(\"利好\"), re.compile(\"新高\"), re.compile(\"扩大\")])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "limit 参数 find_all () 方法返回全部的搜索结构，如果文档树很大那么搜索会很慢。如果我们不需要全部结果，可以使用 limit 参数限制返回结果的数量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['2020年11月20日涨停板早知道：七大利好有望发酵',\n",
       " '军工股午后崛起：航空产业链业绩提升 订单量增速有望扩大',\n",
       " '涨价题材火爆：有机硅价格创年内新高 最全概念股名单来了',\n",
       " '2020年11月19日涨停板早知道：七大利好有望发酵',\n",
       " '2020年11月18日涨停板早知道：七大利好有望发酵']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.find_all(text=[re.compile(\"利好\"), re.compile(\"新高\"), re.compile(\"扩大\")], limit = 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 练习：\n",
    "\n",
    "学习使用 .find, .find_parent, .find_parents, .find_next_siblings, .find_next_sibling, .find_next等等其他命令\n",
    "\n",
    "具体请参阅文档 https://beautifulsoup.readthedocs.io/zh_CN/latest/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 试试在线爬虫\n",
    "\n",
    "把地址改成一个http的地址即可。。。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import  BeautifulSoup\n",
    "import re\n",
    "\n",
    "# 使用一个网易的数据页面\n",
    "url=\"http://quotes.money.163.com/data/caibao/yjgl_ALL.html?reportdate=20220930&sort=publishdate&order=desc&page=0\"\n",
    "\n",
    "def request_url(url):\n",
    "    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36'\n",
    "    headers = {'User-Agent': user_agent} \n",
    "    \n",
    "    res = requests.get(url,headers=headers)\n",
    "    res.encoding = 'utf-8'\n",
    "    return res.text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "soup = BeautifulSoup(request_url(url), 'lxml')\n",
    "content = soup.find('table',class_='fn_cm_table')\n",
    "\n",
    "tmp =[i.text for i in content.find_all('a')]\n",
    "\n",
    "code, names = tmp[0::3], tmp[1::3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['杰华特',\n",
       " '珠城科技',\n",
       " '欧克科技',\n",
       " '星源卓镁',\n",
       " '华新环保',\n",
       " '丰立智能',\n",
       " '聚和材料',\n",
       " '鼎泰高科',\n",
       " '美腾科技',\n",
       " '矩阵股份',\n",
       " '尚太科技',\n",
       " '源杰科技',\n",
       " '长盈通',\n",
       " '锐捷网络',\n",
       " '美埃科技',\n",
       " '天元宠物',\n",
       " '云中马',\n",
       " '微导纳米',\n",
       " '甬矽电子',\n",
       " '永顺泰',\n",
       " '众智科技',\n",
       " '卡莱特',\n",
       " '诺诚健华',\n",
       " '炜冈科技',\n",
       " '首创证券']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['星源卓镁',\n",
       " '华新环保',\n",
       " '丰立智能',\n",
       " '聚和材料',\n",
       " '鼎泰高科',\n",
       " '美腾科技',\n",
       " '矩阵股份',\n",
       " '尚太科技',\n",
       " '长盈通',\n",
       " '锐捷网络',\n",
       " '美埃科技',\n",
       " '天元宠物',\n",
       " '云中马',\n",
       " '甬矽电子',\n",
       " '永顺泰',\n",
       " '众智科技',\n",
       " '卡莱特',\n",
       " '诺诚健华',\n",
       " '炜冈科技',\n",
       " '首创证券',\n",
       " '百济神州',\n",
       " '天振股份',\n",
       " '昆船智能',\n",
       " '中芯国际',\n",
       " '三未信安']"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "names = soup.find_all('a', href = re.compile('#11a01'))\n",
    "[n.text for n in names][1::2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['新浪简介',\n",
       " 'About Sina',\n",
       " '广告服务',\n",
       " '招聘信息',\n",
       " '网站律师',\n",
       " 'SINA English',\n",
       " '会员注册',\n",
       " '产品答疑',\n",
       " '版权所有']"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 新浪财经新闻\n",
    "pagenum = 0\n",
    "url = 'https://finance.sina.com.cn/roll/index.d.html?cid=56588&page='+ str(pagenum)\n",
    "soup = BeautifulSoup(request_url(url), 'lxml')\n",
    "\n",
    "# 组合搜索，class是关键词，因此需要变成 class_\n",
    "#[a['href'] for a in soup.find_all('a', class_ = 'sinatail')]\n",
    "[a.text for a in soup.find_all('a', class_ = 'sinatail')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 作业 一\n",
    "\n",
    "注意URL中的关键词 page=1，请通过改变URL来实现自动翻页爬取多条信息"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### XPATH\n",
    "\n",
    "XPath，全称 XML Path Language，即 XML 路径语言，它是一门在XML文档中查找信息的语言。XPath 最初设计是用来搜寻XML文档的，但是它同样适用于 HTML 文档的搜索。\n",
    "\n",
    "所以在做爬虫时，我们完全可以使用 XPath 来做相应的信息抽取\n",
    "\n",
    "XPath 的选择功能十分强大，它提供了非常简洁明了的路径选择表达式，另外它还提供了超过 100 个内建函数用于字符串、数值、时间的匹配以及节点、序列的处理等等，几乎所有我们想要定位的节点都可以用XPath来选择。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们现用表格列举一下几个常用规则：\n",
    "\n",
    "表达式描述\n",
    "- nodename选取此节点的所有子节点\n",
    "- /从当前节点选取直接子节点\n",
    "- //从当前节点选取子孙节点\n",
    "- .选取当前节点\n",
    "- ..选取当前节点的父节点\n",
    "- @选取属性\n",
    "\n",
    "在这里列出了XPath的常用匹配规则，例如 / 代表选取直接子节点，// 代表选择所有子孙节点，. 代表选取当前节点，.. 代表选取当前节点的父节点，@ 则是加了属性的限定，选取匹配属性的特定节点。\n",
    "\n",
    "chrome 和 firefox 给我们提供了很好的帮助\n",
    "\n",
    "通过 检查 copy copy xpath 完成信息的提取"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'688035'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import requests\n",
    "from lxml import etree\n",
    "\n",
    "\n",
    "url = 'http://quotes.money.163.com/data/caibao/yjgl_ALL.html?reportdate=20200930&sort=publishdate&order=desc&page=0'\n",
    "\n",
    "# //*[@id=\"plate_performance\"]/tbody/tr[1]/td[2]/a\n",
    "\n",
    "selector = etree.HTML(request_url(url))\n",
    "selector.xpath('//*[@id=\"plate_performance\"]/tr[1]/td[2]/a')[0].text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Element a at 0x110c2bd40>,\n",
       " <Element a at 0x110dc4280>,\n",
       " <Element a at 0x110dc4340>,\n",
       " <Element a at 0x110dc4580>,\n",
       " <Element a at 0x110dc4040>,\n",
       " <Element a at 0x110bf6dc0>,\n",
       " <Element a at 0x110dc1e40>,\n",
       " <Element a at 0x110dc1c80>,\n",
       " <Element a at 0x110dc1b40>,\n",
       " <Element a at 0x110bf6f80>,\n",
       " <Element a at 0x110dc1a40>,\n",
       " <Element a at 0x110dc1940>,\n",
       " <Element a at 0x110dc1800>,\n",
       " <Element a at 0x110dc16c0>,\n",
       " <Element a at 0x110dc17c0>,\n",
       " <Element a at 0x110dc13c0>,\n",
       " <Element a at 0x110dc1e80>,\n",
       " <Element a at 0x110dc1f00>,\n",
       " <Element a at 0x110dc1d40>,\n",
       " <Element a at 0x110dc1d80>,\n",
       " <Element a at 0x110dc1b80>,\n",
       " <Element a at 0x110dc1c40>,\n",
       " <Element a at 0x110dc1a80>,\n",
       " <Element a at 0x110dc1b00>,\n",
       " <Element a at 0x110dc1980>]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 去掉一点东西来达到遍历\n",
    "pstring = '//*[@id=\"plate_performance\"]/tr/td[2]/a'\n",
    "selector.xpath(pstring)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['688035',\n",
       " '688306',\n",
       " '301209',\n",
       " '688375',\n",
       " '688163',\n",
       " '301283',\n",
       " '301282',\n",
       " '301270',\n",
       " '688353',\n",
       " '688253',\n",
       " '688320',\n",
       " '301156',\n",
       " '603215',\n",
       " '688102',\n",
       " '301356',\n",
       " '301121',\n",
       " '301276',\n",
       " '688380',\n",
       " '688273',\n",
       " '688237',\n",
       " '001313',\n",
       " '301312',\n",
       " '301233',\n",
       " '603102',\n",
       " '301115']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[x.text for x in selector.xpath(pstring)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'194'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 找到最大页码\n",
    "#/html/body/div[1]/div[4]/div[3]/div[2]/div[2]/div/a[7]\n",
    "pstring = '/html/body/div[1]/div[4]/div[3]/div[2]/div[2]/div/a[7]'\n",
    "selector.xpath(pstring)[0].text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['2022年12月6日涨停板早知道：七大利好有望发酵',\n",
       " '12月5日沪深两市涨停分析：正泰电器入主 通润装备走出10连板',\n",
       " '2022年12月5日涨停板早知道：七大利好有望发酵',\n",
       " '12月2日沪深两市涨停分析：通润装备走出9连板 安奈儿实现7连板',\n",
       " '2022年12月2日涨停板早知道：七大利好有望发酵']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 回到新闻的例子\n",
    "\n",
    "url = 'https://finance.sina.com.cn/roll/index.d.html?cid=56588&page=1'\n",
    "selector = etree.HTML(request_url(url))\n",
    "\n",
    "\n",
    "#//*[@id=\"Main\"]/div[3]/ul[1]/li[2]/a\n",
    "#//*[@id=\"Main\"]/div[3]/ul[1]/li[1]/a\n",
    "pstring = '//*[@id=\"Main\"]/div[3]/ul[1]/li/a'\n",
    "\n",
    "selector.xpath(pstring)\n",
    "[x.text for x in selector.xpath(pstring)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
